A COVID-19 mortality prediction model for Korean patients using nationwide Korean disease control and prevention agency database

The experience of the early nationwide COVID-19 pandemic in South Korea led to an early shortage of medical resources. For efficient resource allocation, accurate prediction of the prognosis or mortality of confirmed patients is essential. Therefore, the aim of this study was to develop an accurate model for predicting COVID-19 mortality using epidemiolocal and clinical variables and for identifying a high-risk group of confirmed patients. Clinical and epidemiolocal variables of 4049 patients with confirmed COVID-19 between January 20, 2020 and April 30, 2020 collected by the Korean Disease Control and Prevention Agency were used. Among the 4049 total confirmed patients, 223 patients died, while 3826 patients were released from isolation. Patients who had the following risk factors showed significantly higher risk scores: age over 60 years, male sex, difficulty breathing, diabetes, cancer, dementia, change of consciousness, and hospitalization in the intensive care unit. High accuracy was shown for both the development set (n = 2467) and the validation set (n = 1582), with AUCs of 0.96 and 0.97, respectively. The prediction model developed in this study based on clinical features and epidemiological factors could be used for screening high-risk groups of patients and for evidence-based allocation of medical resources.

www.nature.com/scientificreports/ models regarding COVID-19, these proposed models are poor with a high risk of bias due to the lack of external validation of models 5 . Since South Korea is geopolitically close to China, it is one of the countries most affected by COVID-19 during the early stage of the pandemic. In reality, Korea experienced an explosive outbreak in the first two months since the first confirmed patient was detected on January 20 12 . A mortality prediction model using a machine method based on sociodemographic and medical information of national health insurance data has been proposed 13 . However, it was focused on socioeconomic variables as predictors rather than clinical and epidemiological factors. Clinical experience and epidemiological characteristics have been reported as major factors associated with heterogeneity of prognosis after COVID-19 confirmation 14 . Therefore, the aim of this study was to establish a COVID-19 mortality prediction model using clinical and epidemiological variables nationally collected by Central Disease Control Headquarters.

Results
Baseline characteristics. Since the first patient was confirmed with COVID-19 on January 20, 2020, 4049 patients were managed by the government database and released from quarantine or death until April 30, 2020. Among 4049 released patients, the case mortality was 5.51% (223 deaths and 3826 recoveries).
We compared the distribution of patients according to epidemiological and clinical characteristics. We also conducted a logistic regression analysis for mortality outcome by unadjusting (univariable) or adjusting (multivariable) covariates. Results are shown in Table 1. In univariable analysis, age over 40, male sex, runny nose, and headache significantly increased the risk of mortality, while having abnormal changes in consciousness (ACC), diabetes, hypertension, cancer history, dementia, and hospitalization in the intensive care unit was protective. www.nature.com/scientificreports/ In the multivariable analysis after adjusting for covariates, age over 40 years and having a runny nose remained significant risk factors for mortality. Protective variables remained protective after adjusting for covariates.
Factors associated with mortality from COVID-19. Table 2 summarizes differences in clinical characteristics for continuous variables and the risk of COVID-19 mortality by 1-unit increase of each clinical variable. Heart rate intensity (OR 1.03, 95% CI 1.02-1.04) and temperature (OR 1.94, 95% CI 1.55-2.43) were associated with an increased risk of COVID-19 mortality. Higher levels of hemoglobin, hematocrit, and lymphocytes were associated with a significantly lower risk of mortality. Based on the exploratory analysis results shown in Tables 1  and 2, a prediction model for the development set was established, as shown in Table 3. The odds ratio (regression coefficient) of mortality risk was determined to produce a risk score. Performance of prediction model. We applied our risk score to our total set, the development set, and the validation set. Figures 1, 2, 3 show comparison results between the predicted mortality and the actual mortality by risk score stratified by decile. Figure 1 shows the results for the total set of participants. Figure 2 describes results for the development set. Figure 3 shows results for the validation set. The performance of each dataset was evaluated using ROC curves. Results are shown in Fig. 4. Our prediction model showed good performance for both the development set and the validation set, with areas under the curve of 0.9656 and 0.9684, respectively.

Discussion
Our study developed and validated a COVID-19 mortality prediction model based on clinical and epidemiological data of COVID-19 4049 confirmed patients recruited by Korea Centers for Disease Control and Prevention. The high AUC value of 0.9684 indicated the good reliability and performance of our model. The course of clinical symptoms of coronavirus ranges from asymptomatic infection to acute respiratory distress (ARDS) and death. As the period of the COVID-19 global pandemic lasts longer, a shortage of medical resources comes earlier.    www.nature.com/scientificreports/ based on medical chart review 15 . Average age of mortality cases was 72 years. Of these mortality cases, 55.1% were women, and 74.5% had an underlying disease. The median length from hospitalization to death was 8 days.
Comorbidities such as diabetes, chronic lung disease, and chronic neurologic disease were significant risk factors associated with COVID-19 mortality. Clinical manifestations observed before death were abnormal heart rate intensity, systolic blood pressure, respiratory rate, oxygen saturated by pulse oximetry on room air, and altered mental status 16 . Although these two studies reported the clinical characteristics of the deceased in detail at the level of descriptive epidemiology, which contributed to the overall understanding of COVID-19 patients, their numbers of cases were relatively small and were not enough for associational inference. One study developed an evidence-based COVID-19 prognostic model for military personnel in Korea 17 . Although there was a problem of generalization since it was developed for soldiers, age, body temperature, physical activity, history of cardiovascular disease, hypertension, visit to a region with an outbreak, feverishness, dyspnea, lethargy, and symptoms of chills were reported as significant predictors (overall C statistic: 0.963; 95% CI: 0.936-0.99) 17 .
Machine learning based COVID-19 mortality prediction on Korean population was reported by several studies 13,18 .
An et al. developed a COVID-19 mortality prediction model using machine learning after recruiting 10,237 COVID-19 confirmed patients and 228 mortality cases between January 20, 2020 and April 16, 2020 13 . This prediction model used various variables, including socioeconomic status linked with National Health Insurance Service. However, specific clinical and epidemiological variables were lacking since that study was focused on the linkage with NHIS data. For mortality prediction, LASSO and linear SVM were used in that study, with AUC values of 0.963 and 0.962, respectively. The most significant factors in the mortality prediction model using LASSO were old age, preexisting DM, and cancer. The most significant factors in random forest were old age, infection route (cluster infection or infection from personal contact), and underlying hypertension 13 . However, that model could not be immediately applied to the field or clinics due to the lack of specific clinical variables.
Das et al. also aimed to predict mortality among confirmed COVID-19 patients in South Korea using machine learning and deploy the best performing algorithm as an open-source online prediction tool for decision-making. They found that the logistic regression algorithm was the best performer in terms of discrimination 18 . Oh et al. aimed to investigate whether comorbid musculoskeletal disorders (MSD)s and pain medication use was associated with in-hospital mortality among patients with COVID-19. They found MSDs were not associated with increased in-hospital mortality among patients with COVID-19 19 . Lee et al. found potential associations between physical activity and risk of infection, severe illness from COVID-19 and COVID-19 mortality using a nationwide cohort from South Korea 20 .
Previous foreign studies have reported that different clinical experiences can lead to substantial heterogeneity in the prognostic trajectory of COVID-19 confirmed patients spanning from patients who are asymptomatic to those with mild, moderate, and severe disease forms with low survival rates 21,22 . A COVID-19 mortality prediction model was developed previously by analyzing data from 3841 confirmed patients in New York, USA recruited from March 9 to April 6, 2020 using machine learning 21 . Sex, age, race, oxygen saturation, COPD, hypertension, and diabetes were found to be significant variables in that model, with AUCs of 0.91 to 0.94. However, blood test results were not included in that model. In that study, the minimum oxygen saturation was emphasized as a central factor in mortality prediction 22 . www.nature.com/scientificreports/ A prediction model was developed after analyzing 53,001 ICU patients requiring mechanical ventilation as well as those diagnosed with pneumonia from the US Medical Information Mart for Intensive Care (MIMIC). When that model was applied to 114 confirmed COVID-19 patients 23 , the AUCs for 12, 24, 48, and 72 h were reported to be 0.82, 0.81, 0.77, and 0.75, respectively 23 . Our study probably used the largest data set up to date to predict COVID-19 mortality involving specific clinical features of COVID-19 patients in Korea. The main advantage of our study was that we collected large range of clinical and epidemiological variables at the time when patient was enrolled as a confirmed case. The results were obtained after a certain period of health system encounter or immediately after the diagnosis of COVID-19. Although we merely conducted logistic regression analysis, both the development and validation sets showed high areas under the curve (0.9656 and 0.9684, respectively). although there have been studies with larger sample sizes or extensive data collections, they had difficulty on interpretation on the results due to lack of algorithm. Moreover, our model has the advantage of being able to easily interpret factors associated with the high mortality rate of individuals according to the detailed algorithm shown in the model. In that context, our model has high practical value for risk stratification in the clinical field.
The main limitation of our study was the issue of validation. Although our dataset was relatively large and involved specific clinical features, we merely conducted internal validation due to the lack of a dataset that had similar sizes and variables in Korea. Thus, the possibility of overestimation exists, which requires cautious interpretation of our results.
However, in terms of Personal Information Protection issues, current COVID-19 mortality data in Korea is merely collected and managed by the Government agency called Korean Disease Control and Prevention Agency (KDCA). Thus, no other dataset was available in Korea rather than the KDCA dataset. Thus, an external validation study using data from COVID-19 patients that occurred afterwards is required in the future.

Subjects and methods
Study population. Our study was based on the dataset established by Korean Disease Control and Prevention Agency Central Disease Countermeasure Headquarters. Individual-level data for 4049 COVID-19 patients whose quarantine release was confirmed among patients infected between January 20, 2020 and April 30, 2020 were collected. Complete nationwide inpatient and outpatient data of patients who visited any medical institution with a confirmed diagnosis of COVID-19 during the study period were obtained. The definition of COVID-19 confirmation was determined by positive PCR-based clinical laboratory testing for SARS-CoV-2. Personal information deidentification measures were applied in accordance with governmental guidelines for nonidentification measures and proceeded in accordance with adequacy evaluation.
Risk factor measurement. The collected data used in our study included 41 variables categorized into seven subtypes as follows: (1) basic data (age, sex, death/quarantine released, length of stay between infection and death/quarantine released, pregnancy), (2) body index (height, weight), (3) initial examination finding (systolic/diastolic blood pressure, heart rate, body temperature), (4) clinical findings at hospitalization (history of fever, cough, sputum production, sore throat, runny nose/rhinorrhea, muscle aches/myalgia, fatigue/ malaise, shortness of breath/dyspnea, headache, altered consciousness/confusion, vomiting/nausea, diarrhea), (5) comorbidity and past history (diabetes, hypertension, heart failure, chronic heart condition, asthma, chronic obstructive pulmonary disease, chronic kidney failure, cancer, chlimitaronic hepatic disease, rheumatism, autoimmune disease, dementia), (6) sickbed type and clinical severity, and (7) complete blood cell count. Each variable was either self-reported or recorded by professional health care providers. Mortality was defined when a patient with COVID-19 died during their encounter with the health system during the study period (January 1, 2020 ~ April 30, 2020). The data usage and study design of our study were approved by the Institutional Review Board of Ewha Womans University Seoul Hospital, and informed consent was obtained from each subject (SEUMC 2020-09-009). All methods were performed in accordance with relevant guidelines and regulations. Statistical analysis. Risk scores for our COVID-19 mortality prediction model were developed by logistic regression analysis. We stratified our data into two groups: 60% random sampling (development set data) for model development and the remaining 40% (test data set) for internal validation. All analyses were conducted using SAS version 9.4 (SAS Institute Inc., Cary, NC, USA). www.nature.com/scientificreports/